Spotify is the largest on-demand music service provider, in large part due to their implementation of new technologies and application of big data. Spotify’s competitive advantage is their ability to transform streaming into a personalized experience built by machine learning and natural language processing algorithms. Spotify uses recommendation systems to engage and retain listeners, increase customer satisfaction, and increase revenue.
Music data gives insights into a listener’s psyche. Use cases of music data are continuing to grow; It’s used to curate targeted marketing campaigns, plan artist tour routes, and track music tastes. Spotify's data collection is referred to as “emotional surveillance” or “music intelligence”.
The data science objective within this project is to understand the listener’s profile. We’ll be analyzing the mood of user playlists and recommend songs using a content-based filtering system.
In this analysis, we'll be focusing on several key questions.
!pip install spotipy
!pip install dash
!pip install jupyter_dash
from dash import Dash, dcc, html, Input, Output
import dash_core_components as dcc
import dash_html_components as html
from dash.dependencies import Input, Output
from jupyter_dash import JupyterDash
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from plotly.figure_factory import create_2d_density
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import numpy as np
from sklearn.manifold import TSNE
from scipy.spatial.distance import cdist
We’ll be performing data collection through the use of the Spotify API. Using a client ID and secret key acquired through spotify's developer dashboard, we are able to access playlist data using links sent to us by friends
cid = '4819f4999fb246ee975d7bccacc5162b'
secret = '41dce8cfa93041bc94b04454c9b5ed8b'
client_credentials_manager = SpotifyClientCredentials(client_id=cid, client_secret=secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)
stored_playlists={\
'huseinH_playlist':'https://open.spotify.com/playlist/787g5agmc3mkhMP2yuuUq4?si=I9XqOCjHRpeTp0OATzR1pA',\
'cristinC_playlist':'https://open.spotify.com/playlist/1hJGuIjNJ5FqnzpHPLhiBq?si=FyEfvFXiRSetkuAKyyF13Q',\
'haleyS_playlist':'https://open.spotify.com/playlist/3fyXTn20ryd5Qi8tVgbrQE?si=l5Ke7OWiRpGbUONlWdj86Q',\
'samH_playlist':'https://open.spotify.com/playlist/7uT4eRwuzhWRoH5pySFvJs?si=HuIZ_hmGTqaC5SsEdcTHRg',\
'jennaS_playlist':'https://open.spotify.com/playlist/0NPdcEC2d23aRUGVX3lWg4?si=b-iboKJyQB6DEuWQ7aAu7A',\
'natalieH_playlist':'https://open.spotify.com/playlist/6RBxXZ5qbNQ1uLoVAa7NbX?si=IWr68P2DRziv_lzODVNTAg',\
'mehulG_playlist':'https://open.spotify.com/playlist/1UnaV6jlGUbOcQfAy1sX5t?si=H4Ard-QHQ4u4t7VzUQoHYg',\
'architM_playlist':'https://open.spotify.com/playlist/1CJ0iKHy0CsEwwgIBlFEGQ?si=H5iLkQJlRKe7KzBRbUcVKw',\
'callyL_playlist':'https://open.spotify.com/playlist/1vYuLAvb7R0OLJrEtcyDLD?si=lyyPqXKoS422fMLsm8DYNQ',\
'top_songs_global':'https://open.spotify.com/playlist/37i9dQZEVXbNG2KDcFcKOF?si=1333723a6eff4b7f',\
'get_turnt':'https://open.spotify.com/playlist/37i9dQZF1DWY4xHQp97fN6',\
'mellow_bars':'https://open.spotify.com/playlist/37i9dQZF1DWT6MhXz0jw61',\
'new_music_friday':'https://open.spotify.com/playlist/37i9dQZF1DX4JAvHpjipBk',\
'rap_caviar':'https://open.spotify.com/playlist/37i9dQZF1DX0XUsuxWHRQd',\
'pop_all_day':'https://open.spotify.com/playlist/37i9dQZF1DXarRysLJmuju',\
'songs_to_sing_in_the_car':'https://open.spotify.com/playlist/37i9dQZF1DWWMOmoXKqHTD',\
'level_up':'https://open.spotify.com/playlist/5Ea3GbZtAjQ4wEHfnbH3Bn?si=8PxpPcgYT5qS5sYo-IDlJQ' }
Using the Spotipy API wrapper, we pulled songs and corresponding audio features from each playlist. We stored the data in a dictionary of pandas dataframes.
for name in stored_playlists:
link = stored_playlists[name]
df = pd.DataFrame(columns = ['track name','artist','popularity','danceability','energy','key','loudness',\
'mode','speechiness','acousticness','instrumentalness','liveness','valence',\
'tempo'])
df = df.astype({'track name':'string','artist':'string','popularity':'float64','danceability':'float64',\
'energy':'float64','key':'float64','loudness':'float64','mode':'float64',\
'speechiness':'float64','acousticness':'float64',\
'instrumentalness':'float64','liveness':'float64', \
'valence':'float64','tempo':'float64'})
for song in sp.playlist_tracks(link)['items']:
track = song['track']['name']
artist = song['track']['artists'][0]['name']
pop = song['track']['popularity']
af = sp.audio_features(song['track']['uri'])[0]
df = df.append({'track name':track,'artist':artist,'popularity':pop,'danceability':af['danceability'],\
'energy':af['energy'],'key':af['key'],'loudness':af['loudness'],'mode':af['mode'],\
'speechiness':af['speechiness'],'acousticness':af['acousticness'],\
'instrumentalness':af['instrumentalness'],'liveness':af['liveness'], \
'valence':af['valence'],'tempo':af['tempo']},ignore_index=True)
stored_playlists[name] = df
We're going to create a 'master' dataframe with all of the playlist data to use for analysis. We also use the all_playlists data to fit the MinMaxScaler in order to scale all of the audio features with a range outside of [0,1] to the same scale.
all_playlists = pd.DataFrame(columns = ['user','track name','artist','popularity','danceability','energy','key','loudness',\
'mode','speechiness','acousticness','instrumentalness','liveness','valence',\
'tempo'])
all_playlists = all_playlists.astype({'user':'string','track name':'string','artist':'string','popularity':'float64','danceability':'float64',\
'energy':'float64','key':'float64','loudness':'float64','mode':'float64',\
'speechiness':'float64','acousticness':'float64',\
'instrumentalness':'float64','liveness':'float64', \
'valence':'float64','tempo':'float64'})
for k,v in stored_playlists.items():
v['user']=k
all_playlists = pd.concat([all_playlists,v])
sc = MinMaxScaler(feature_range=(0,1))
all_playlists[['popularity','loudness','key','tempo']] = sc.fit_transform(all_playlists[['popularity','loudness','key','tempo']])
for k,v in stored_playlists.items():
v[['popularity','loudness','key','tempo']] = sc.transform(v[['popularity','loudness','key','tempo']])
stored_playlists[k]= v
all_playlists.describe()
Spotify provides these audio features in order to better understand the underlying data of audio tracks. We've created box plots in order to gauge the distribution of audio features.
From this plot, we can see that in the playlists provided we have high popularity, energy, danceabililty, and loudness, and low acousticness, liveness, and speechiness. There is a distributed range of tempo and valence for the songs.
columns = ['popularity','danceability','energy','valence','tempo','acousticness','liveness','speechiness','loudness']
fig = go.Figure()
for n,col in enumerate(columns):
fig.add_trace(go.Box(y=all_playlists[col].values, name=all_playlists[col].name,boxpoints='all',\
text=all_playlists['track name']))
fig.update_layout(
title_text = "All Playlists Audio Features",
showlegend = True,
paper_bgcolor = "white",
width = 900
)
fig.show(renderer='notebook')
Context:
In order to compare playlists to one another, we've aggregated the audio features to display on radio graphs.
The graphs display the average audio features for each user's playlist and give insight into the type of music that they like. In analyzing these graphs, we roughly decipher what users like similar music, and also how music tastes differ.
graph = ['cristinC_playlist','haleyS_playlist','mehulG_playlist','samH_playlist','architM_playlist','top_songs_global','level_up','callyL_playlist']
fig = make_subplots(rows=4, cols=2, specs=[[{'type': 'polar'}]*2]*4)
colors = ["","mediumseagreen","darkorange","mediumpurple","magenta","limegreen","gold","blue","green","red","yellow"]
placement = [[],[1,1],[1,2],[2,1],[2,2],[3,1],[3,2],[4,1],[4,2]]
for n,name in enumerate(graph,start=1):
temp = pd.DataFrame(stored_playlists[name],columns=['popularity','danceability','energy','loudness','valence','tempo','speechiness','acousticness'])
fig.add_trace(
go.Scatterpolar(
r=temp.mean().values,
theta=temp.columns,
fill='toself',
name= name,
fillcolor = colors[n], opacity=0.3, line=dict(color=colors[n])
), row=placement[n][0], col=placement[n][1])
fig.update_layout(polar=dict(radialaxis=dict(visible=False,range = [0,1],),),
polar2=dict(radialaxis=dict(visible=False,range = [0,1],),),
polar3=dict(radialaxis=dict(visible=False,range = [0,1],),),
polar4=dict(radialaxis=dict(visible=False,range = [0,1],),),
polar5=dict(radialaxis=dict(visible=False,range = [0,1],),),
polar6=dict(radialaxis=dict(visible=False,range = [0,1],),),
polar7=dict(radialaxis=dict(visible=False,range = [0,1],),),
polar8=dict(radialaxis=dict(visible=False,range = [0,1],),),
paper_bgcolor = "white")
fig.update_layout(height=1000, width=1000, title_text="Playlist Audio Features",showlegend= True)
fig.show(renderer='notebook')
We can look at the same data conveyed on one radar chart where you can also filter to look at specific users.
This graph makes it easier to compare two playlist at once, and also gives a good overview of all the playlist stacked on top of eachother.
fig = go.Figure()
graph = ['cristinC_playlist','haleyS_playlist','samH_playlist','jennaS_playlist','natalieH_playlist','top_songs_global','mehulG_playlist','huseinH_playlist','level_up','callyL_playlist','get_turnt','mellow_bars','new_music_friday','rap_caviar','pop_all_day','songs_to_sing_in_the_car']
colors = ["","maroon","rosybrown","mediumseagreen","darkorange","mediumpurple","magenta","limegreen","gold","blue","yellow","red","purple","gold","green","orange","pink","cyan"]
for n,name in enumerate(graph,start=1):
temp = pd.DataFrame(stored_playlists[name],columns=['popularity','danceability','energy','loudness','valence','tempo','speechiness','acousticness'])
fig.add_trace(
go.Scatterpolargl(
r=temp.mean().values,
theta=temp.columns,
name= name,
marker=dict(size=15, color=colors[n])))
fig.update_traces(mode="markers", marker=dict(line_color='white', opacity=0.7))
fig.update_layout(
title = "Playlist Audio Features",
font_size = 15,
showlegend = True,
paper_bgcolor = "white"
)
fig.update_layout(polar=dict(radialaxis=dict(visible=False,range = [0,1],),),
polar2=dict(radialaxis=dict(visible=False,range = [0,1],),),
polar3=dict(radialaxis=dict(visible=False,range = [0,1],),),
polar4=dict(radialaxis=dict(visible=False,range = [0,1],),),
polar5=dict(radialaxis=dict(visible=False,range = [0,1],),),
polar6=dict(radialaxis=dict(visible=False,range = [0,1],),),
polar7=dict(radialaxis=dict(visible=False,range = [0,1],),),
polar8=dict(radialaxis=dict(visible=False,range = [0,1],),),
polar9=dict(radialaxis=dict(visible=False,range = [0,1],),),
)
fig.show(renderer='notebook')
Visualize the correlations between different audio features.
scatter_df = all_playlists[['popularity','danceability','energy','loudness','valence','tempo','speechiness','acousticness']]
fig = px.scatter_matrix(scatter_df)
fig.show(renderer='notebook')
There's a high positive correlation between 'energy' - 'loudness' and a high negative correlation between 'energy' and 'acousticness'. We'll look at this in more detail below.
fig = px.density_heatmap(all_playlists, x=all_playlists['energy'], y=all_playlists['loudness'], nbinsx=30, nbinsy=30, color_continuous_scale="YlGnBu")
fig.show(renderer='notebook')
We can cluster song features together using K-means and visualize with TSNE. These clusters of songs are our own created genres.
This tells us what songs are the most alike to one another. The genres aren't explicitly defined as 'pop' or 'rap' but rather are a collection of songs with the same audio features, and may very well fall into the same genres.
Dash is an interactive feature of plotly used to create interactive dashboards. In this case, we're using dash in order to visualize the songs with different numbers of clusters and in different dimensions.
First, we create dropdowns to configure the graph functionality.
Then, in the "update graph" portion, we configure how we are going to display the graph. Because we are changing the clustering or dimensionality when we change the values in the dropdown, we recompute the k-means pipeline here.
import warnings
warnings.filterwarnings('ignore')
app = JupyterDash(__name__)
app.layout = html.Div([
html.Div([
html.Div([dcc.Dropdown(
options={'2': '2 dimensions','3': '3 dimensions'},
value='2',
id='dimensions'
)]), #style={'width': '48%', 'display': 'inline-block'}),
html.Div([dcc.Dropdown(
options={'2': '2 clusters','3': '3 clusters','4': '4 clusters','5':'5 clusters','6':'6 clusters'},
value='3',
id='clusters'
)]), #style={'width': '48%', 'display': 'inline-block'}),
dcc.Graph(id='indicator-graphic'),]),])
@app.callback(Output('indicator-graphic', 'figure'),[Input('dimensions', 'value'),Input('clusters', 'value')])
def update_graph(dimensions, clusters):
dimensions = dimensions.split(' ')[0]
clusters = clusters.split(' ')[0]
cluster_pipeline = Pipeline([('scaler', StandardScaler()), ('kmeans', KMeans(n_clusters=int(clusters)))])
X = all_playlists.select_dtypes(np.number)
cluster_pipeline.fit(X)
all_playlists['cluster'] = cluster_pipeline.predict(X)
tsne_pipeline = Pipeline([('scaler', StandardScaler()), ('tsne', TSNE(n_components=int(dimensions), verbose= False))])
genre_embedding = tsne_pipeline.fit_transform(X)
if int(dimensions) == 2:
projection = pd.DataFrame(columns=['x', 'y'], data=genre_embedding)
projection.insert(0,'cluster', all_playlists['cluster'].tolist(), True)
projection.insert(0,'name', all_playlists['track name'].tolist(), True)
fig = go.Figure(data=[go.Scatter(x = projection['x'],y = projection['y'],\
mode = 'markers',marker=dict(color=projection['cluster'],colorscale='Viridis'))])
elif int(dimensions) == 3:
projection = pd.DataFrame(columns=['x', 'y','z'], data=genre_embedding)
projection.insert(0,'cluster', all_playlists['cluster'].tolist(), True)
projection.insert(0,'name', all_playlists['track name'].tolist(), True)
fig = go.Figure(data=[go.Scatter3d(x = projection['x'],y = projection['y'],z=projection['z'],\
mode = 'markers',marker=dict(color=projection['cluster'],colorscale='Viridis'))])
fig.update_traces(hovertemplate=projection['name'])
return fig
app.run_server(mode='inline',debug=True)
In our corpus of songs, you can recommend songs based on what your friends like.
First, we calculate the means of each vector of audio features for the center. Then, using cosine similarity we find the top closest songs.
users = ['cristinC_playlist','haleyS_playlist','callyL_playlist','samH_playlist','jennaS_playlist','natalieH_playlist','architM_playlist','mehulG_playlist','huseinH_playlist',\
'top_songs_global','get_turnt','mellow_bars','new_music_friday','rap_caviar','pop_all_day','songs_to_sing_in_the_car','level_up']
columns=['popularity','danceability','energy','loudness','valence','tempo','speechiness','acousticness','instrumentalness','liveness','mode','key']
user_means = np.empty((0,len(columns)))
for n,user in enumerate(users):
temp = np.array(all_playlists.loc[all_playlists['user'] == user][columns].mean(),dtype=float)
temp = temp.reshape((1,len(columns)))
user_means = np.vstack((user_means,temp))
cluster_df = pd.DataFrame(user_means)
cluster_df['user']=users
def get_recs(user_center,user_name,all_playlists):
current_songs = all_playlists.loc[all_playlists['user'] == user_name]['track name'].tolist()
current_songs = {i:0 for i in current_songs}
distances = cdist(user_center.reshape(1, -1), all_playlists[columns], 'cosine')
index = list(np.argsort(distances)[:,:50][0])
recs = all_playlists.iloc[index]
song_recs = []
for n,i in enumerate(recs['track name'].tolist()):
if i not in current_songs:
song_recs.append(i)
return list(set(song_recs))[:10]
for n,user in enumerate(user_means):
if n > 8:
break
recs = get_recs(user,users[n],all_playlists)
user = users[n].split('_')[0]
print(user)
print(recs)
print()
For conciseness, we output the top 10 songs recommended to each user's playlist. By audio feature, these songs are all very similar to the audio features of the playlist.
The best 'true' metric of success in deciding if these recommendations are accurate or not is to input your own playlist and listen to our recommendations! The fun feature about these recommendations is that they are songs that your friends listen to.
In addition to finding new tracks based on playlist features, we can also figure out what playlists are most similar to one another.
def similar_playlists(user_center,user_name,user_means,users):
distances = cdist(user_center.reshape(1, -1), user_means, 'cosine')
index = list(np.argsort(distances)[:,:4][0])
recs = [users[x] for x in index]
similar_users = []
for n,i in enumerate(recs):
if i != user_name:
similar_users.append(i)
return list(set(similar_users))[:10]
for n,user in enumerate(user_means):
similar_users = similar_playlists(user,users[n],user_means,users)
print(users[n])
print(similar_users)
print()
Above we output the closest three playlists to each playlist. We can also look at similar playlists with k-means clustering.
cluster_pipeline = Pipeline([('scaler', StandardScaler()), ('kmeans', KMeans(n_clusters=6))])
X = cluster_df.select_dtypes(np.number)
cluster_pipeline.fit(X)
cluster_df['cluster'] = cluster_pipeline.predict(X)
tsne_pipeline = Pipeline([('scaler', StandardScaler()), ('tsne', TSNE(n_components=3, verbose=False))])
users_embedding = tsne_pipeline.fit_transform(X)
projection = pd.DataFrame(columns=['x', 'y','z'], data=users_embedding)
projection.insert(0,'cluster', cluster_df['cluster'].tolist(), True)
projection.insert(0,'name', cluster_df['user'].tolist(), True)
fig = go.Figure(data=[go.Scatter3d(x = projection['x'],y = projection['y'],z=projection['z'],\
mode = 'markers',marker=dict(color=projection['cluster'],colorscale='Viridis'))])
fig.update_traces(hovertemplate=projection['name'])
fig.show(renderer='notebook')
cluster_df[['user','cluster']].sort_values(by='cluster')
Looking at these clusters we can gauge their accuracy.
The users playlists fit into their own clusters, having similar taste in music.
Using a spotify listener's playlist and pulling the audio features for every track, we can learn alot about a spotify user and their friends.
In this report, we've discovered how audio features range, how songs cluster together, and how we can use audio features to recommend new songs. We've looked at the specific audio features for a user's playlist to compare a user to other users. This spotify music analysis is interactive and insightful, and can be fit to new data by adding the link to a new playlist.